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Abstract 

In disease mapping, the aim is to estimate the spatial pattern in disease 
risk over an extended geographical region, so that areas with elevated risks 

Hcan be identified. A Bayesian hierarchical approach is typically used to 
produce such maps, which models the risk surface with a set of spatially 
\^ smooth random effects. However, in complex urban settings there are 

-4_J likely to be boundaries in the risk surface, which separate populations that 

are geographically adjacent but have very different risk profiles. Therefore 
r/} this paper proposes an approach for detecting such risk boundaries, and 

tests its effectiveness by simulation. Finally, the model is applied to lung 
I cancer incidence data in Greater Glasgow, Scotland, between 2001 and 

2005. 

Boundary detection; Disease mapping; Spatial correlation 
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1 Introduction 

OO 

Disease mapping is the area of statistics that quantifies the spatial pattern in 
t-H disease risk over an extended geographical region, such as a city of country. 

t-H The study region is partitioned into a number of small non-overlapping areal 

units, such as electoral wards, and only the total number of cases in each unit is 
available. The majority of disease risk maps are produced using Bayesian hier- 
archical models, utilising a vector of covariate risk factors and a set of random 
effects. The latter model any overdispersion or spatial correlation in the disease 
data (after the covariate effects have been allowed for), which can be caused by 
the presence of unmeasured risk factors that also have a spatial structure. The 
random effects are often represented by a Conditional Autoregressive (CAR) 
prior, which forces the effects in two areas to be correlated if those areas share 
a common border. 



- r— I 



The production of such maps provide a number of benefits to public health 
professionals, including the ability to investigate the association between dis- 
ease prevalence and suspected risk factors. More recently, disease maps have 



1 



been used to identify boundaries (or cliffs or discontinuities) in the risk sur- 
face, which separate geographically adjacent areas that have high and low risks. 
Such boundaries are likely to exist in complex urban settings, where rich and 
poor communities can be separated by just a few metres. The detection of such 
boundaries has a number of benefits, including the ability to detect the spatial 
extent of a cluster of high risk areas. It is also important economically, because 
it allows health resources to be targeted at communities with the highest dis- 
ease risks. Furthermore, the spatial units at which the health and covariate data 
are available are typically designed for administrative purposes, and thus often 
do not delineate between distinct neighbourhoods (i.e. groups of people with 
the same social circumstances, culture and behaviour). Therefore, boundaries 
in disease risk surfaces may also correspond to boundaries between different 
neighbourhoods, which are of interest because "their locations reflect underly- 



ing biological, physical, and/or social processes" ( Jacquez et al. (2000)) 



The statistical detection of boundaries in disease maps is also known as 
Wombling, following the seminal article by Womble (19511. More recently, a 



number of different approaches to this problem have been proposed, including 



the calculation of local statistics (Boots (2001 )), and a Bayesian random effects 



model (Lu and Carlin (20051). In this paper we also use a Bayesian random 
effects model, and identify boundaries by measuring the level of dissimilarity 
between the populations living in neighbouring areas. We believe that risk 
boundaries are more likely to occur between populations that are very differ- 
ent, because homogeneous populations should have similar disease risks. The 
remainder of this paper is organised as follows. Section 2 provides a review 
of disease mapping and boundary detection methods, while Section 3 presents 
our proposed methodological extension. Section 4 assesses the efficacy of our 
approach using simulation, while Section 5 identifies boundaries in the risk sur- 
face for lung cancer cases in Greater Glasgow. Finally, Section 6 contains a 
concluding discussion and outlines future developments. 



2 Background 

2.1 Disease mapping 

The data used to quantify disease risk are denoted by y = (j/i, . . . , y n ) and E = 
(Ei, . . . , E n ), the former being the numbers of disease cases observed in each of n 
non-overlapping areas within a specified time frame. The latter are the expected 
numbers of disease cases, which depend on the size and demographic structure 
of the population living within each area. Disease risk can be summarised by 
the standardised incidence ratio (SIR), which for area k is given by Rk = yk/Ek- 
Values above one represent areas with elevated risks of disease, while values less 
than one correspond to relatively healthy areas. However, elevated SIR values 
can occur by chance in areas where Ek is small, so the set of disease risks for all 
n areas are more commonly estimated using a Bayesian hierarchical model. A 
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general specification has been described by 



Elliott et al. (2000), Banerjee et al 



(2004), Wakefield (2007) and Lawson (2008), and is given by 



Yk\E k , R k 
ln(R k ) 



Poisson(i?/ £ i?/ £ ) 
<\>k- 



for k = 1, . 



. ,71, 



xJ/3 



(1) 



Disease risk is modelled by covariates = (x k i, . . . ,x kp ), and random 
effects <p = (0i, . . . , </> n ), the latter allowing for any overdispersion and spatial 
correlation in the disease data (after the covariate effects have been accounted 
for). The most common prior for 4> is a conditional autoregressive (CAR, Besag 
et al. ( 1991[ )) model, which is specified in terms of n univariate conditional 
distributions, f(<f>k\<Pi, ■ ■ ■ , <Pk-li <Pk+i, ■ ■ ■ , 4>n) for k = 1, . . . , n. However, each 
full conditional distribution only depends on the values of <pj in a small number 
of neighbouring areas. This neighbourhood information is contained in a binary 
adjacency matrix W, which has elements w k j that are equal to one or zero 
depending on whether areas (k,j) are defined to be neighbours. A common 
specification is that areas (k,j) are neighbours if they share a common border, 
which corresponds to w k j—l and is denoted by k ~ j. A number of priors have 
been proposed within the general class of CAR models, and the one we adopt 



here was originally proposed by Leroux et al 



reviewed by MacNab (20031 and Lee (2011). The model has full conditional 



(1999), and has subsequently been 



distributions given by 



<fe|0_ fe ,W,T 2 ,/j,/z~ N 



(2) 

The conditional expectation is a weighted average of the random effects in 
neighbouring areas and a global intercept p (not included in 0), where the 
weights are controlled by p. In this model p = corresponds to independence, 
while p close to one defines strong spatial correlation. These full conditional 
distributions correspond to a proper multivariate Gaussian distribution if p G 
[0, 1), which is given by 



~ N(/il, r 2 [pW* + (l-p)I}- 1 ), 



(3) 



where / denotes annxn identity matrix while 1 is an n- vector of ones. In the 
above equation W* has diagonal elements w* kk = Y17=i w kii an d non-diagonal 
elements w k j = —w k j. This model simplifies to the intrinsic autoregressive 
model if p is fixed at one, although this does correspond to an improper joint 
distribution for cf>. 



2.2 Boundary detection 



Lu and Carlfn] p005| propose the use of Boundary Likelihood Values (BLV), 



which are calculated as BLVfcj = \R k — Rj\, the absolute difference in risk 
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between two neighbouring areas. The border between neighbouring areas (k, j) 
can then be classified as a boundary in the risk surface if either 



(a) BLVfcj > ci for some cut-off c%; or 



(b) BLV/cj is within the top Ci% of the boundary likelihood values over the 
study region, for some percentage c-i- 

These approaches to boundary detection are ad-hoc, because the decision 



rules (a) and (b) require tuning constants (ci or C2) to be specified. Jacquez 
et al. ( 2000[ ) has criticised this approach for this reason, and argues that by 
specifying the tuning constant the investigator essentially chooses the number 
of boundaries that are identified, even though this is unknown and the goal of 
the analysis. 



3 Methods 

The approach we propose allows the data to determine the number and locations 
of any boundaries in the risk surface, rather than requiring the investigator to 
specify a tuning constant. We achieve this by modelling Wf.j as a binary ran- 
dom quantity if areas (k, j) share a common border, rather than assuming it 
is fixed at one. If Wkj is estimated as zero the random effects in areas (k,j) 
are conditionally independent, which corresponds to a boundary in the risk sur- 
face. In contrast, if u>kj equals one the random effects are correlated, which 
corresponds to no boundary. This general approach to boundary detection has 



previously been proposed by Lu et al. (2007), Ma and Carlin (20071, and Ma 



et al. (2010), who model the set of Wkj by logistic regression, a CAR prior, 
or an Ising model. However, this requires the large set of Wkj for all pairs of 
neighbouring areas to be estimated, which Li et al. (2011) argue are not well 



identified from the data. In this paper we model the set of w^j as a function of 
parameters a. = (ax, . . . , a q ), rather than treating each Wkj as a separate un- 
known quantity. This results in a parsimonious yet flexible model for detecting 
boundaries in the risk surface, which avoids the weak parameter identifiability 



and slow MCMC convergence experienced by Li et al. (20111, when modelling 
each Wkj separately. 



3.1 Level 1 - Observation model 

The first stage of our hierarchical model is similar to that described in Section 
2, and is given by 



Yk\Ek, Rk 
ln(R k ) 

: \cf>_ k ,n,a,T 2 



Poisson(iJfc-Rfc) for k = 1, 



,n, 



N 



(4) 



'0.99X;"=i^fej(a)^+0.01^ r 2 

0-99 E"=i w kj (a) +0.01 ' 0.99X;™ = i ^fcj(a) +0.01 
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In common with Lu et al. (2007) no covariates are included in because 
<fi would then represent the residual pattern in disease risk, and any boundaries 
identified would hence be in the residual surface. Secondly, we fix p at 0.99 be- 
cause it enforces strong spatial correlation, which allows the presence or absence 
of boundaries to be determined by W{a). We note that we do not choose p = 1, 
as it results in an infinite mean and variance in the conditional distribution of 
4>k{see |2j)) if an area is surrounded by boundaries, i.e. if J2j=i w kj{ a ) = for 
some area k. 



3.2 Level 2 - Neighbourhood model 

We believe that boundaries in the risk surface are likely to occur between pop- 
ulations that are very different, because homogeneous populations should have 
similar risk profiles. Therefore, we model the presence or absence of a bound- 
ary between areas by a vector of q non-negative dissimilarity metrics 
Zfcji = {zkn, ■ ■ ■ , Zkjq)- These metrics have the general form 

\Zki Zji If ■ -, / r \ 

Zkji = — for i = 1, . . . , q, (5) 

the absolute difference in the value of a covariate between the two areas in 
question. Here, <7j represents the standard deviation of \zki~ Zji \ over all pairs of 
contiguous areas, and we re-scale the dissimilarity metrics to improve the mixing 
and convergence of the MCMC algorithm. It is these dissimilarity measures that 
drive the detection of boundaries in the risk surface, and examples could include 
differences in the population's social characteristics (e.g. average income) or risk 
inducing behaviour (e.g. smoking prevalence). Using these metrics we model 
the elements of W(ct) as 



■*<«> = {I a 3 "*"""""' • < 6 > 

where pairs of areas that do not share a common border have Wkj(ot) fixed 
at zero. For areas that are contiguous, the model detects a boundary in the risk 
surface if 

cxp(— Yll—i ZkyiOLi) is less than 0.5. Therefore we constrain the regression pa- 
rameters to be non-negative, so that the greater the dissimilarity between two 
areas the more likely there is to be a boundary between them. In contrast, if 
two areas have identical covariate values (and hence homogeneous populations) 
there cannot be a boundary between them, regardless of the value of a. This is 
the reason we do not include an intercept term in ([6]), as doing so would allow 
boundaries to be detected between areas with homogeneous populations. The 
regression parameters determine the number of risk boundaries in the study 
region, with larger values of a corresponding to more boundaries being de- 
tected. If only one dissimilarity metric z is included in ([6]), then a plausible 
range of values for the single regression parameter a can be determined. At 
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one extreme, no boundaries will be detected if a < — ln(0.5)/ z max , while at the 
other, all borders in the study region will be considered as boundaries (unless 
Zkj=0) if a > — ln(0.5)/z mm . Here (z mm , z max ) denote the minimum positive 
and maximum values of the dissimilarity metric. More generally, if there are q 
dissimilarity metrics then boundaries are identified if 

exp(— z/yiai) x ... x exp(-z kjq a q ) < 0.5, 

where the value of each component, exp(— ZkjiOLi), must lie between zero 
and one. Therefore, if a,; < — \xl(0 .5) / z™ ax , the dissimilarity measure z^ji is not 
solely responsible for detecting any boundaries, because exp(— ZiCti) would be 
greater than 0.5 for all pairs of contiguous areas. Therefore in terms of interpre- 
tation, the dissimilarity metric can be said to have no effect on detecting bound- 
aries if the entire 95% credible interval for at is less than a m i n — — ]n(0.5)/z%j- x . 
In contrast, if the interval lies completely above a m i n , then the metric can be 
said to have a substantial effect on identifying risk boundaries. We note that 
the usual statistical representation of 'no effect' (credible interval that includes 
zero) is not possible in this context, because the regression parameters are con- 
strained to be non-negative. We also note that the approach we outline does not 
guarantee that the boundaries we detect will be closed (form an unbroken line, 
an example of which is shown in Figure [3]), which allows us to detect bound- 
aries that enclose an entire subregion, as well as those that just separate highly 
different areas. 



3.3 Level 3 - Hyperpriors 

The vector of random effects depend on hyperparameters (/i, r 2 ,o:), which re- 
spectively control its mean, variance and correlation structure. The mean \x 
is assigned a weakly informative Gaussian prior distribution, with a mean of 
zero and a variance of 10. The variance parameter r 2 is assigned a weakly 
informative Uniform(0, 10) prior on the standard deviation scale, following the 



suggestion of Gelman ( 2006 ) . We adopt a uniform prior for the regression pa- 
rameters, (Xi ~ Uniform(0, Mi), which corresponds to our prior ignorance about 
the number of boundaries in the risk surface. An alternative would be a recip- 
rocal prior, f(cti) oc ~J[0 < a* < Mi], which represents our prior belief that 
the risk surface is spatially smooth. In both cases a natural upper limit would 
be Mi — — m(0.5)/£K|™, the value at which the dissimilarity measure Zkji solely 
identifies all borders as boundaries in the risk surface. However, in a bound- 
ary detection analysis one is looking to identify boundaries between collections 
of areas, which have similar risks within each collection but differ across the 
boundary. Therefore we fix Mj so that at most 50% of borders can be classified 
as boundaries, and present a sensitivity analysis in the supplementary material 
to this choice of 50%. 
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4 Simulation study 



In this section we present a simulation study, that assesses the accuracy with 
which our proposed model can detect boundaries in the risk surface. In doing 
this we assess whether our model can detect 'true' boundaries in the risk surface, 
as well as the extent to which it falsely identifies boundaries that do not exist. 
We have decided not to compare our model to the existing boundary detection 
approach using 0, ([2} and BLVs, because this requires a tuning constant (ci or 
C2) to be specified by the user, which in real life would be unknown. However, 
these tuning constants are known in this simulation setting, and specifying their 
value would put this method at an unfair advantage or disadvantage, depending 
on whether we chose the 'correct' values. 



4.1 Data generation 

We base our study on the n — 271 areas that comprise the Greater Glasgow and 
Clyde health board, which is the region considered in the cancer mapping study 
presented in Section five. Disease counts are generated from the Poisson model 
Q, where the expected numbers of admissions, E, relate to the Glasgow cancer 
data. A new risk surface R = exp(</>) is generated for each set of simulated 
disease data, because this ensures the results are not affected by a particular 
realisation of <fi. Each simulated risk surface has fixed boundaries, which are 
shown by the bold black lines in Figure [TJ There are 74 boundaries in total, 
which corresponds to approximately 10% of the set of borders in the study 
region. This set of boundaries partition the study region into 6 groups, the 
main area shaded in white, and the remaining 5 smaller areas shaded in grey. 
To produce risk surfaces with boundaries, the random effects (</>) are generated 
from a multivariate Gaussian distribution with a piecewise constant mean, which 
in the white region is equal to 0, while in the grey regions it is equal to ki- The 
correlation function is from the Matern class with smoothness parameter equal 
to k — 2.5, while the spatial range is fixed so that the median correlation 
between areas is 0.5. The dissimilarity metrics are generated from 

J |N(1,0.5 2 )| if areas (k,j) are not separated by a boundary. 
fe J y |N(1 + fc 2i 0.5 2 )| if areas (fc, j) are separated by a boundary. 

(7) 

Here, larger values of ki correspond to dissimilarity metrics that better iden- 
tify the true boundaries in the risk surface; i.e. have larger values for the bound- 
aries in Figure [I] than for the non-boundaries. We assess the effect on model 
performance of changing both the size of the boundaries (via k\ ) and the quality 
of the dissimilarity metrics (via fc 2 ), the results of which are displayed in Table 
[T] When assessing the effect of boundary size we fix ki = 3, which provides 
nearly 'perfect' dissimilarity metrics, i.e. values for Zkj for boundaries and non- 
boundaries that almost never overlap. Conversely, when assessing the effect of 
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Figure 1: Locations of the true boundaries in the simulated risk surfaces. 








having imperfect dissimilarity metrics, we fix the boundary size at k\ = 0.4, 
which provides fairly large boundaries to identify. 

4.2 Results 

The top half of Table [l] displays the effects of changing the magnitude of the 
risk boundaries, in the idealised situation of having perfect dissimilarity metrics. 
The constant k\ represents the difference in the mean value of the risk surface 
between white and grey areas (see Figure [T]), hence smaller values correspond to 
a spatially smoother surface [k\ = would correspond to a completely smooth 
risk surface with no boundaries). The table shows that the model can detect 
larger risk boundaries more often than smaller ones as expected, although it 
still achieves over a 90% detection rate when the risk surface only has a mean 
difference of 0.2. The much lower percentages for smaller values of k\ are also 
not surprising, as they correspond to situations in which the average size of the 
true boundaries are not very different to the average size of the non-boundaries. 
In addition, the false positive rates are very low (generally less than 1%) regard- 
less of the boundary size, which suggests that detected boundaries are likely to 
be real. 

The bottom half of Table [T] displays the effects of having imperfect dissimilar- 
ity metrics, i.e. metrics for which some of the values corresponding to the 'true' 
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Table 1: The effect of boundary size (as measured by fci) and the quality of 
the dissimilarity metrics (as measured by k 2 ) on the effectiveness of the model. 
The table displays the percentage agreement for boundaries (BA) and non- 
boundaries (NBA), as well as the bias and root mean square error (RMSE) of 
the estimated risk surface, which are presented as a percentage of their true 
valuer 



Comparison 


ki 


k 2 


BA (%) NBA (%) 


Bias 


RMSE 




0.4 


3 


99.97 


98.70 


-0.123 


5.689 


Boundary 


0.3 


3 


99.57 


99.16 


-0.195 


5.700 


Size 


0.2 


3 


93.76 


99.43 


-0.187 


5.689 




0.1 


3 


48.31 


99.89 


-0.092 


5.706 




0.05 


3 


25.84 


100 


-0.139 


5.663 




0.4 


3 


99.97 


98.70 


-0.123 


5.689 


Quality 


0.4 


2 


98.89 


95.07 


-0.134 


5.747 


Of Zkj 


0.4 


1.5 


96.27 


88.37 


-0.140 


5.866 




0.4 


1 


87.19 


80.85 


-0.206 


6.226 




0.4 


0.5 


55.74 


80.93 


-0.242 


6.927 




0.4 





1.85 


98.82 


-0.248 


7.178 



boundaries are smaller than some of those corresponding to the non-boundaries. 
As the constant k 2 decreases, there is a greater overlap between the values of 
the dissimilarity metrics at boundaries and non-boundaries. The limit of k 2 = 
corresponds to a dissimilarity metric with no information, i.e. it has the same 
range of values for boundaries as non-boundaries. The table shows that if the 
dissimilarity metric is nearly perfect (k 2 — 3 corresponds to the means in equa- 
tion ([7]) being separated by 6 standard deviations), then the model nearly always 
correctly identifies boundaries and non-boundaries. However, as the informa- 
tion content in the dissimilarity metric decreases (as k 2 decreases) so does the 
performance of the model, both in terms of the boundary agreement and the 
false positive rate. If the dissimilarity metric contains no information the model 
only identifies 1.86% of the true boundaries as would be expected, although in 
this situation the false positive rate returns to being low at around 1%. 

5 Case study - Cancer risk in Greater Glasgow 

This section presents a study mapping the risk of lung cancer in Greater Glas- 
gow, Scotland, between 2001 and 2005. 

5.1 Data description 

The data for our study are publicly available, and can be downloaded from 
the Scottish Neighbourhood Statistics (SNS) database (http://www.sns.gov.uk). 
The study region is the Greater Glasgow and Clyde health board, which con- 
tains the city of Glasgow in the east, and the river Clyde estuary in the west. 
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Glasgow is known to contain some of the poorest people in Europe (Leyland 



et al. (2007)), and has rich and poor communities that are geographically adja- 
cent. The Greater Glasgow and Clyde health board is partitioned into n = 271 
administrative units called Intermediate Geographies (IG), which were devel- 
oped specifically for the distribution of small-area statistics, and have a median 
area of 124 hectares and a median population of 4,239. 



The disease data we model are the number of people diagnosed with lung 
cancer between 2001 and 2005 in each IG, which corresponds to ICD-10 codes 
C33 - C34. The expected numbers of cases in each IG are calculated by exter- 
nal standardisation, using age and sex adjusted rates for the whole of Scotland. 
These rates were obtained from the Information Services Division (ISD), which 
is the statistical arm of the National Health Service in Scotland. The simplest 
measure of disease risk is the standardised incidence ratio, which is presented in 
Figure [2] as a choropleth map. The Figure shows that the risk of lung cancer is 
highest in the heavily deprived east end of Glasgow (east of the study region), 
as well as along the banks of the river Clyde (the thin white line running south 
east). The Figure also shows that cancer incidence in Greater Glasgow is higher 
than in the rest of Scotland, as the average SIR across the study region is 1.186. 



Large amounts of covariate data are available from the Scottish Neighbour- 
hood Statistics database, and the first variable we consider is a modelled esti- 
mate of the percentage of the population in each IG that smoke, further details 
of which are available from Whyte et al. ( |2007 ). The causal relationship between 



smoking and lung cancer risk is long standing (see for example Doll and Brad- 



ford Hill ( 1950 ) and Doll et al. ( 2005 )), and it is likely to be the most important 



dissimilarity metric in our study. Cancer risk has also been shown to vary by 
ethnic group (see for example National Cancer Intelligence Network (20091), so 
we include the percentage of school children from ethnic minorities as a proxy 
measure. Socio-economic deprivation is also associated with cancer risk (see for 
example Quinn, M and Babb, P (2000) and Woods et al. (2006)), and as a proxy 
measure we use the natural log of the median house price. This proxy measure 
is used because it is the measure of deprivation available that is least correlated 
with the smoking covariate (Pearson's r=-0.69). Finally, we acknowledge that 
other factors such as diet and physical activity have been repeatedly associated 
with lung cancer risk. However, no data are available at the small-area level 
about these factors. 



5.2 Results - modelling 

Inference for all models is based on 50,000 MCMC samples generated from five 
Markov chains, that were initialised at dispersed locations in the sample space. 
Each chain is burnt-in until convergence (40,000 iterations), and the next 10,000 
samples are used for the analysis. A number of models were fitted to the data 
with different combinations of the three dissimilarity metrics, and in each case 
the residuals were assessed for the presence of spatial correlation. This was 
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achieved using a permutation test based on Moran's I statistic (Moran (1950)), 
and in all cases the model adequately removes the spatial correlation present in 
the data, as the corresponding p-values (not shown) are greater than 0.05. 



Initially, all three covariates (smoking, ethnicity and house price) were in- 
cluded in the model as dissimilarity metrics, and the posterior medians and 95% 
credible intervals are displayed in Table [5J Also displayed in Table [2] is a m i n , the 
threshold value below which the dissimilarity metric does not solely detect any 
boundaries in the risk surface. The table shows that smoking prevalence has 
substantial influence in detecting boundaries, as the estimate and credible inter- 
val lie above the threshold value of a m i n = 0.131. In contrast, neither ethnicity 
nor house price have any effects in detecting risk boundaries, as their credible 
intervals both lie below the corresponding 'no effect' thresholds. This suggests 
that the existence of risk boundaries only depend on smoking prevalence, and 
that the other covariates do not add anything to the model. 

To provide more evidence for this, the fit of the four possible models that 
include smoking prevalence as a dissimilarity metric were compared, which in- 
cluded the full model, smoking on its own, and smoking with each of the two 
remaining covariates separately. In all cases both the Deviance Information Cri- 
terion and the number of boundaries detected remained largely unchanged. In 
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Table 2: Covariate effects from the initial model, presented as estimates (pos- 
terior medians) and 95% credible intervals. In addition, the threshold value 
otmin = — ln(0 .5) / z max is presented, the value at which each dissimilarity met- 
ric does not solely identify any risk boundaries. 



Covariate 


Estimate 


95% credible interval 




Smoking 


0.232 


(0.171, 0.254) 


0.131 


Ethnicity 


0.012 


(0.001, 0.046) 


0.126 


House price (log) 


0.015 


(0.001, 0.103) 


0.119 



the smoking only model the sole regression parameter has a posterior median 
and 95% credible interval of 0.257 (0.239, 0.262), which, in common with the 
results in Table [2] lies completely above the 'no effect' threshold. 

5.3 Results - risk maps 

Figure [3] displays the estimated risk surface for lung cancer from the smoking 
only model, where the shading is on the same scale as that used in Figure [2j 
The solid white lines denote the risk boundaries that have been identified by 
the smoking covariate, which are defined by having posterior median values of 
Wkj equal to zero. We note that the study region is split completely in two by 
the river Clyde (the thin white line running south east), and areas on opposite 
banks are not assumed to be neighbours. Therefore no boundaries can be de- 
tected across the river, which explains the absence of boundaries in this area. 

The figure shows that the smoking covariate detects 162 boundaries in the 
risk surface (23.1% of all possible borders), including within the city of Glasgow 
(middle of the map) as well as along the southern coast of the Clyde estuary (far 
west of the map). The majority of these estimated risk boundaries appear to 
correspond to sizeable changes in the risk surface, suggesting that the smoking 
covariate appears to be an appropriate dissimilarity metric for detecting such 
boundaries. However, a few of the boundaries identified show no evidence of 
separating areas with differing health risks, such as the closed boundary in the 
south of the city. These 'false positives' correspond to two areas having different 
smoking prevalences but similar risk profiles, and would be a starting point for 
a more detailed investigation into why the risk profiles are similar given the 
vastly different smoking rates. 

6 Discussion 

In this paper we have proposed a statistical approach to detecting boundaries 
in disease risk maps, which separate populations that exhibit high and low risks 
of disease. Our approach detects boundaries by measuring the dissimilarity be- 
tween populations living in neighbouring areas, because we believe that abrupt 
changes in the risk surface are most likely to occur between populations that 
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Figure 3: Estimated risk surface for lung cancer in Greater Glasgow between 
2001 and 2005. The risk boundaries are denoted by solid white lines. 








are geographically adjacent but have very different social characteristics or risk 
inducing behaviour. Our approach has the advantage of being fully automatic, 
so that unlike the approach of |Lu and Carlin| ( |2005[ ), the number of boundaries 
in the risk surface is determined by the data and not a-priori by the investi- 
gator. In addition, our approach identifies boundaries using a small number 
of regression parameters a, rather than estimating the set of neighbourhood 



relations {wkj} for all pairs of contiguous areas. Li et al.| ( 2011 1 have suggested 
that this latter approach (used by Lu et al. (20071 and Ma and Carlin (2007)) 



can lead to identifiability problems and slow MCMC convergence, due to the 
large number of parameters to be estimated. For example, in the lung cancer 
application presented in Section 5, there are 701 neighbourhood relations Wkj, 
compared with disease data in only 271 areas. 



The simulation study presented in Section 4 suggests that our approach 
generally performs well, both in terms of detecting true boundaries in the risk 
surface, as well as not detecting large numbers of false positives. In the presence 
of a perfect dissimilarity metric our approach can detect the majority of true 
risk boundaries, for example, it has a 99.6% detection rate when the average 
difference in the log risk surface is 0.3. The main drawback of our approach is 
illustrated by the bottom half of Table [T] which shows that it is crucially de- 
pendent on the existence of good quality dissimilarity metrics, that have large 
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values for risk boundaries and small values for non-boundaries. However, in the 
absence of good dissimilarity metrics (as in the last few rows of Table [TJ our 
approach appears to remain conservative, in the sense that the false positive 
rate is relatively low. 

As our approach to boundary detection is based solely on covariates, it has 
the advantage that in addition to identifying risk boundaries, it also identifies 
the underlying drivers of these boundaries. However, as illustrated by the sim- 
ulation study, the corresponding disadvantage is that it relies on the existence 
of relevant covariate data, which may not always be available. Our lung cancer 
example also illustrates this, as the main covariate (smoking prevalence) iden- 
tifies the majority of the risk boundaries evident from Figure [3j However, it 
also identifies some boundaries that do not appear to be real, as well as ignor- 
ing others where there is a suspected discontinuity in the risk surface. This 
imperfection could be caused by the omission of other important dissimilarity 
metrics, or by the fact that the smoking covariate is a modelled estimate rather 
than being real prevalence data. Therefore, the production of risk maps such 
as Figure [3] can be viewed as an exploratory tool, which helps the investigator 
understand the drivers of the spatial variation in disease risk. For example, if 
a risk boundary is not detected despite their being a discontinuity in the risk 
surface, further investigation of the two areas in question could be carried out 
to determine what other factors could be causing this discontinuity. 

Finally, this paper opens up numerous avenues for future work. The most 
obvious is to develop a boundary detection method that has the advantages of 
the method proposed here, but does not rely on covariate data to detect risk 
boundaries. One possibility in this vein is to develop an iterative algorithm, 
which compares the fit (possibly using DIC) of a number of models with different 
but fixed neighbourhood matrices W. In this way one could compare a model 
where all adjacent areas have uikj = 1, against an alternative with Wkj = for 
areas that are suspected of being separated by a discontinuity in the risk surface. 
Other natural extensions to this approach include the detection of boundaries 
in multiple disease risk surfaces simultaneously, as well as adding a temporal 
dimension to the model. 
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